variation budget
- North America > United States > Nebraska > Lancaster County > Lincoln (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Hampshire > Southampton (0.04)
- North America > United States > Nebraska > Lancaster County > Lincoln (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > United Kingdom > England > Hampshire > Southampton (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Ohio (0.04)
- Asia > Singapore (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Ohio (0.04)
- Asia > Singapore (0.04)
Adversarial Blocking Bandits
We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation.
Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning
Nonaka, Hiroshi, Ambrozak, Simon, Miskala-Dinc, Sofia R., Ercole, Amedeo, Prins, Aviva
In this work, we propose three efficient restart paradigms for model-free non-stationary reinforcement learning (RL). We identify two core issues with the restart design of Mao et al. (2022)'s RestartQ-UCB algorithm: (1) complete forgetting, where all the information learned about an environment is lost after a restart, and (2) scheduled restarts, in which restarts occur only at predefined timings, regardless of the incompatibility of the policy with the current environment dynamics. We introduce three approaches, which we call partial, adaptive, and selective restarts to modify the algorithms RestartQ-UCB and RANDOMIZEDQ (Wang et al., 2025). We find near-optimal empirical performance in multiple different environments, decreasing dynamic regret by up to $91$% relative to RestartQ-UCB.
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > India (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Ohio (0.04)
- Asia > Singapore (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Ohio (0.04)
- Asia > Singapore (0.04)
- North America > United States > Nebraska > Lancaster County > Lincoln (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Hampshire > Southampton (0.04)